Imbalanced Classification

Imbalanced classification addresses scenarios where the distribution of classes is heavily skewed, with one or more classes (minority) significantly underrepresented compared to others (majority). This is extremely common in real-world applications: fraud detection (fraudulent transactions might be <1% of total), medical diagnosis (rare diseases), spam filtering, anomaly detection, and fault detection in manufacturing. Standard classification algorithms trained on imbalanced data often develop a bias toward the majority class, achieving high overall accuracy while failing to identify minority class instances.

The accuracy paradox illustrates the core challenge: a classifier that always predicts the majority class achieves high accuracy but zero utility. If fraud occurs in 0.1% of transactions, predicting "not fraud" for everything yields 99.9% accuracy while catching zero fraud cases. This makes accuracy a misleading metric for imbalanced problems, necessitating alternative evaluation approaches focused on minority class performance.

Data-level approaches modify the training distribution. Random oversampling duplicates minority class examples, risking overfitting to specific instances. Random undersampling removes majority class examples, potentially discarding useful information. SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority examples by interpolating between existing instances in feature space. Tomek Links and Edited Nearest Neighbors remove majority instances near decision boundaries to create cleaner separation. Hybrid methods combine oversampling and undersampling.

Algorithm-level approaches modify learning algorithms. Cost-sensitive learning assigns higher misclassification costs to minority class errors, encoded through class weights in the loss function. Many algorithms (SVM, logistic regression, tree-based methods) support class weighting. Threshold adjustment shifts the decision boundary by changing the probability threshold for classification (e.g., using 0.2 instead of 0.5 for positive class). Ensemble methods like BalancedRandomForest or EasyEnsemble combine resampling with ensemble learning.

Evaluation metrics for imbalanced classification prioritize minority class performance. Precision-Recall curves and Average Precision focus on positive class performance. F1-score (or F-beta score with adjustable weight) balances precision and recall. Matthews Correlation Coefficient provides a balanced measure even with imbalance. ROC-AUC can still be useful but may be optimistic. Per-class metrics (sensitivity, specificity, precision for each class) reveal performance disparities hidden by aggregate metrics.

Anomaly detection represents an extreme form of imbalance where the minority class is exceptionally rare and may not be well-represented in training data. This often requires specialized one-class classification or unsupervised anomaly detection methods rather than standard supervised approaches.